The riparian plot is the primary method to visualize syntenic relationships among >2 species. GENESPACE offers this functionality and considerable customization opportunities. Here, we will illustrate some of these possibilities.
Typically, the user will run the GENESPACE pipeline via
run_genespace(gsParam = x), then call the function
plot_riparian(gsParam = x) where x is the GENESPACE
parameter list. This method has the benefit of using the pre-checked
file paths stored in x and can thus calculate and access
the block coordinates without any user specifications. Since
plot_riparian() requires access to the ‘syntenic hits’ text
files, which can be very large, this tutorial does not include the
source data to repeat the exact analyses. Instead users will have to
construct the entire GENESPACE run from the scripts included below. The
full run takes about 20 minutes to complete.
library(GENESPACE)
## GENESPACE v1.1.10: synteny and orthology constrained comparative
## genomics
NOTE This section is copied from the primary GENESPACE guide. If you have already run that, no need to do this part.
Set the paths for your run. Change these paths to those valid on your system
genomeRepo <- "~/path/to/store/rawGenomes"
wd <- "~/path/to/genespace/workingDirectory"
path2mcscanx <- "~/path/to/MCScanX/"
Download the data of interest. Here we are going to use human, mouse, platypus, chicken, and sand lizard. To do this programatically, get the URLs to the genome annotations, but leave off the ‘genomic.gff.gz’ string at the end, then make the full paths by appending ‘translated_cds.faa.gz’ and ‘genomic.gff.gz’ to the ends of the paths. Finally make the directories to write to and download the files.
urls <- c(
human ="000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_",
mouse = "000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_",
platypus = "004/115/215/GCF_004115215.2_mOrnAna1.pri.v4/GCF_004115215.2_mOrnAna1.pri.v4_",
chicken = "016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_",
sandLizard = "009/819/535/GCF_009819535.1_rLacAgi1.pri/GCF_009819535.1_rLacAgi1.pri_")
genomes2run <- names(urls)
urls <- file.path("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF", urls)
translatedCDS <- sprintf("%stranslated_cds.faa.gz", urls)
geneGff <- sprintf("%sgenomic.gff.gz", urls)
names(translatedCDS) <- genomes2run
names(geneGff) <- genomes2run
writeDirs <- file.path(genomeRepo, genomes2run)
names(writeDirs) <- genomes2run
for(i in genomes2run){
print(i)
if(!dir.exists(writeDirs[i]))
dir.create(writeDirs[i])
download.file(
url = geneGff[i],
destfile = file.path(writeDirs[i], basename(geneGff[i])))
download.file(
url = translatedCDS[i],
destfile = file.path(writeDirs[i], basename(translatedCDS[i])))
}
Parse the annotations to fastas with headers that match a gene bed file
genomes2run <- c("human", "mouse", "platypus", "chicken", "sandLizard")
parsedPaths <- parse_annotations(
rawGenomeRepo = genomeRepo,
genomeDirs = genomes2run,
genomeIDs = genomes2run,
presets = "ncbi",
genespaceWd = wd)
Conduct the GENESPACE run … depending on your machine, this can take a few minutes up to an hour.
gpar <- init_genespace(
wd = wd,
ploidy = 1,
path2mcscanx = path2mcscanx)
out <- run_genespace(gpar, overwrite = T)
## Loading objects:
## gsParam
Here is what the default plot looks like using the human genome as the reference. Each syntenic block, phased and colored by the human reference genome chromosomes, are visualized as braids. Each chromosome is a rounded rectangle polygon.
ripDat <- plot_riparian(
gsParam = out,
refGenome = "human",
forceRecalcBlocks = FALSE)
The forceRecalcBlocks argument (default = TRUE) ignores
any existing reference-phased block coordiates and re-calculates them
according to the specifications in the plot_riparian
function call. In general, this is a good idea to ensure that your
parameters are being respected; however, it can be much faster to not
recalculate block coordinates, especially in large runs.
For a full list of all the parameters, see the end of this tutorial.
Instead of aggregated syntenic regions, we can plot just the fully
collinear syntenic blocks using useRegions = FALSE. We can
also use physical position instead of gene rank order with
useOrder = FALSE.
ripDat <- plot_riparian(
out,
refGenome = "human",
useOrder = FALSE,
useRegions = FALSE)
By default, the genomes are ordered following the positions in the
species tree from orthofinder. We can adjust this using the
genomeIDs parameter. We can also switch up which genome is
used to phase the blocks with refGenome.
ripDat <- plot_riparian(
gsParam = out,
refGenome = "mouse",
genomeIDs = c("mouse", "human", "platypus", "chicken"),
forceRecalcBlocks = FALSE)
By default, syntenyWeight = 0.5, which means that
chromsomes are re-ordered across the genomes to maximize synteny while
also preserving local ordering of chromosomes by the order of chromosome
names. For example, if both Chr1 and 2 are at similar syntenic
positions, but Chr2 is slightly more to the left than Chr1, Chr1 will
still come first in the synteny plot. If syntenyWeight = 1,
only synteny is considered and names are ignored. If
syntenyWeight = 0 (equivalent to
reorderBySynteny = FALSE), only chromosome names are used
and synteny is ignored.
ripDat <- plot_riparian(
gsParam = out,
#reorderBySynteny = FALSE,
syntenyWeight = 0,
refGenome = "human")
We can also change how the chromosomes are ordered using two
approaches. Either using a custom order for the reference genome with
customRefChrOrder.
ripDat <- plot_riparian(
gsParam = out,
refGenome = "human",
customRefChrOrder = c("X", 1:22))
Or alternatively, by supplying a function that re-orders the all the
chromosomes. By default, refChrOrdFun orders the reference
genome chromosomes so that first chromosomes with smaller numeric values
are on the left, then break the ties by chromosomes that come earlier in
the alphabet are to the left, and if still tied, choose a random one.
But we can mix this up. For example, any chromosomes that are not
numbered by an integer go first, then the integers:
ordFun <- function(x)
data.table::frank(
list(!grepl("[a-zA-Z]", x),
as.numeric(gsub("[a-zA-Z]", "", x))),
ties.method = "random")
ripDat <- plot_riparian(
gsParam = out,
refChrOrdFun = ordFun,
refGenome = "human",
reorderBySynteny = FALSE)
By default,
plot_riparian(minChrLen2plot = ifelse(useOrder, 100, 1e5))
excludes any scaffolds that have < 100 genes (is using gene rank
order) or < 100kb (if using physical position). We can plot all
chromosomes by setting this to 0, or only plot larger chromosomes by
increasing this value. For example, show even the tiny little
scaffolds:
ripDat <- plot_riparian(
gsParam = out,
minChrLen2plot = 0)
In some cases (e.g. closely related genomes), it may be desirable to
highlight where inverted sequences exist. We can do this with
inversionColor, which is set to NULL by default. If given a
color, this will highlight inverted blocks.
ripDat <- plot_riparian(
gsParam = out,
genomeIDs = c("chicken", "sandLizard"),
refGenome = "chicken",
inversionColor = "green")
Depending on the differences in genome sizes, we can also adjust how
the gaps between chromosomes are treated. The two parameters that do
this are gapProp, which controls the gap size between
chromosomes in the largest genome, and scaleGapSize, which
controls how large the gaps are for smaller genomes. Here, we double the
gap size and double the amount of spread between other genomes. This
way, chromosomes are more evenly distributed from top to bottom in the
plot.
To improve clarity, we can also add some spacing between the ends of
the braids. These normally end right at the middle of the chromosome
polygons. However, we can bring them out further with
scaleBraidGap, which adds a gap proportional to the height
of the chromosome polygons. This probably should be used in conjunction
with larger gaps via gapProp (to make the breaks more
obvious). Its also sometimes better to make the chromosome polygons less
rounded and more like a rectangle so the bounds are clearer when the
braids don’t go under the chromosome polygons.
ripDat <- plot_riparian(
gsParam = out,
gapProp = 0.02,
scaleBraidGap = 2,
scaleGapSize = .5,
howSquare = 2,
refGenome = "human")
In some cases, it can improve clarity by inverting certain
chromosomes. For example, the mouse X chromosome is somewhat reversed in
order relative to platypus and humans. Lets invert the order of that one
and a few others to illustrate how this works. Inverted chromosomes get
a * flag next to their names. We can also plot more scaffolds with few
genes on them by decreasing minChrLen2plot, which has a
default of 100 genes or 100kb depending on if you are using gene order
or not. In some cases, it may be benefical to only show the chromosome
IDs for certain genomes (like if all of the genomes are highly
syntenic). You can choose which genomes to be labeled with
labelTheseGenomes. The unlabeled chromosomes get smaller in
this case to make it easier to view. We can change size of the text with
chrLabFontSize.
invchr <- data.frame(
genome = c("mouse", "mouse", "mouse", "mouse", "chicken"),
chr = c(4, 3, 1, "X", 5))
ripDat <- plot_riparian(
gsParam = out,
minChrLen2plot = 10,
invertTheseChrs = invchr,
refGenome = "human",
chrLabFontSize = 7,
labelTheseGenomes = c("human", "mouse", "chicken"))
Genomes with low levels of synteny, like the case here can be hard to
visualize. It may be helpful to adjust the coloring and transparency of
the braids. This can be accomplished with palette and
braidAlpha. We can also make a dark palette to put on a
white background andchange the color of the chromosomes using
chrFill …
ggthemes <- ggplot2::theme(
panel.background = ggplot2::element_rect(fill = "white"))
customPal <- colorRampPalette(
c("darkorange", "skyblue", "darkblue", "purple", "darkred", "salmon"))
ripDat <- plot_riparian(
gsParam = out,
palette = customPal,
braidAlpha = .75,
chrFill = "lightgrey",
addThemes = ggthemes,
refGenome = "human")
We can also use the riparian plot to highlight specific regions of interest (e.g. Fig. 2 here). Lets repeat this analysis with the five test genomes here. To mimic figure 2, we have to add three other customizations: use human as the reference genome, order the human chromosomes as X, 1-22, and order the genomes with sandLizard at the bottom.
roi <- data.frame(
genome = c("human", "chicken"),
chr = c("X", "Z"),
color = c("#FAAA1D", "#17B5C5"))
ripDat <- plot_riparian(
gsParam = out,
highlightBed = roi,
refGenome = "human",
genomeIDs = c("sandLizard", "chicken", "human", "mouse", "platypus"),
customRefChrOrder = c("X", 1:22))
You can also only plot the regions of interest without the background
synteny by setting backgroundColor to NULL.
ripDat <- plot_riparian(
gsParam = out,
highlightBed = roi,
backgroundColor = NULL,
genomeIDs = c("sandLizard", "chicken", "human", "mouse", "platypus"),
refGenome = "human",
customRefChrOrder = c("X", 1:22))
plot_riparian(...) defaults to the following basic
parameters:
useOrder = TRUE: the synteny map and chromosome sizes
reflect gene rank order (and not physical basepair position)useRegions = TRUE: plot syntenic blocks aggregated by
gsParam$params$blkBuffer radius. This produces less complex
synteny maps, but can sometimes yield regions that appear to be
overlapping when they are not.genomeIDs = gsParam$genomeIDs: plot all of the genomes
that were used in the GENESPACE runreorderBySynteny = TRUE: order chromosomes in all
genomes so that they maximize synteny to a reference genomesyntenyWeight = 0.5: either 0, .5 or 1, specifying
whether synteny should be ignored (equivalent to
reorderBySynteny = FALSE), partially weighted or fully
weighted.refGenome = NULL (see reorderBySynteny):
no reference genome specified means that the first haploid (ploidy = 1)
genome in the gsParam$genomeIDs vector is usedrefChrOrdFun = frank(list(as.numeric(gsub("\\D+", "", x)), x), ties.method = "random"):
order the reference genome chromosomes so that first chromosomes with
smaller numeric values are on the left, then break the ties by
chromosomes that come earlier in the alphabet are to the left, and if
still tied, choose a random one.customRefChrOrder = NULL: the reference genome
chromosomes are ordered by these chromosome IDs instead of
refChrOrdFun.invertTheseChrs = NULL: keep gene order along a
chromosome identical to that in the bed filesminChrLen2plot = ifelse(useOrder, 100, 1e5): since
using gene order only show chromosomes that contain > 100 genes.braidAlpha = 0.5: 50% transparency of the synteny
mapscaleBraidGap = 0: the braids should end exactly at the
middle of the chromosome polygons.palette = gs_colors: use the gs_colors R
color palette, which goes from red to pink without any greens.gapProp = 0.005: The gap size between chromosomes of
the largest genome are 0.5% of the entire genome length.howSquare = 0: use a perfect circle on the ends of the
chromosome polygonschrExpand = 1: there should be a buffer 1x the hieght
of the chromosome label text.chrFill = "white: Chromosome polygons are white.labelTheseGenomes = gsParam$genomeIDs: all chromosomes
in all genomes are labeledchrLabBorderLwd = 0.5: use lwd = 0.5 for the borders of
the chromosomes.inversionColor = NULL: if not NULL, color the
inversions differently by this value.chrLabFun = function(x) gsub("^0", "", gsub("chr|scaf|chromosome|scaffold|^lg|_", "", tolower(x))):
strip some common strings off of the chromosome namesxlabel = sprintf("Chromosomes scaled by %s", ifelse(useOrder, "gene rank order", "physical position")):
since default is to useOrder, the x-axis label is “Chromosomes scaled by
gene rank order”.chrLabFontSize = 5: the size of the chromosome
labels.pdfFile = NULL: do not write the plot to file, just
print to the current graphic device.forceRecalcBlocks = TRUE: should the blocks be
re-calculated, even if a phasedBlk file exists?scalePlotHeight = 1: scale plot height by the number of
genomesscalePlotWidth = 1: use 8” wide plotting windowaddThemes = NULL: do not pass any additional custom
ggplot2 themes to the plothighlightBed = NULL do not highlight specific
regionsbackgroundColor = NULL: not used since
highlightBed = NULLblk = NULL: not used unless you have a specific region
to plot.